Improved Bilinear Pooling with CNNs

نویسندگان

Tsung-Yu Lin

Subhransu Maji

چکیده

Bilinear pooling of Convolutional Neural Network (CNN) features [22, 23], and their compact variants [10], have been shown to be effective at fine-grained recognition, scene categorization, texture recognition, and visual question-answering tasks among others. The resulting representation captures second-order statistics of convolutional features in a translationally invariant manner. In this paper we investigate various ways of normalizing these statistics to improve their representation power. In particular we find that the matrix square-root normalization offers significant improvements and outperforms alternative schemes such as the matrix logarithm normalization when combined with elementwise square-root and `2 normalization. This improves the accuracy by 2-3% on a range of fine-grained recognition datasets leading to a new state of the art. We also investigate how the accuracy of matrix function computations effect network training and evaluation. In particular we compare against a technique for estimating matrix square-root gradients via solving a Lyapunov equation that is more numerically accurate than computing gradients via a Singular Value Decomposition (SVD). We find that while SVD gradients are numerically inaccurate the overall effect on the final accuracy is negligible once boundary cases are handled carefully. We present an alternative scheme for computing gradients that is faster and yet it offers improvements over the baseline model. Finally we show that the matrix square-root computed approximately using a few Newton iterations is just as accurate for the classification task but allows an order-of-magnitude faster GPU implementation compared to SVD decomposition.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Where to Focus: Deep Attention-based Spatially Recurrent Bilinear Networks for Fine-Grained Visual Recognition

Fine-grained visual recognition typically depends on modeling subtle difference from object parts. However, these parts often exhibit dramatic visual variations such as occlusions, viewpoints, and spatial transformations, making it hard to detect. In this paper, we present a novel attention-based model to automatically, selectively and accurately focus on critical object regions with higher imp...

متن کامل

Chimpanzee Faces in the Wild: Log-Euclidean CNNs for Predicting Identities and Attributes of Primates

In this paper, we investigate how to predict attributes of chimpanzees such as identity, age, age group, and gender. We build on convolutional neural networks, which lead to significantly superior results compared with previous state-of-the-art on hand-crafted recognition pipelines. In addition, we show how to further increase discrimination abilities of CNN activations by the Log-Euclidean fra...

متن کامل

Learning Convolutional Neural Networks using Hybrid Orthogonal Projection and Estimation

Convolutional neural networks (CNNs) have yielded the excellent performance in a variety of computer vision tasks, where CNNs typically adopt a similar structure consisting of convolution layers, pooling layers and fully connected layers. In this paper, we propose to apply a novel method, namely Hybrid Orthogonal Projection and Estimation (HOPE), to CNNs in order to introduce orthogonality into...

متن کامل

Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks - Counting, Detection, and Tracking

For crowded scenes, the accuracy of object-based computer vision methods declines when the images are lowresolution and objects have severe occlusions. Taking counting methods for example, almost all the recent state-of-the-art counting methods bypass explicit detection and adopt regressionbased methods to directly count the objects of interest. Among regression-based methods, density map estim...

متن کامل

MoNet: Moments Embedding Network

Bilinear pooling has been recently proposed as a feature encoding layer, which can be used after the convolutional layers of a deep network, to improve performance in multiple vision tasks. Different from conventional global average pooling or fully connected layer, bilinear pooling gathers 2nd order information in a translation invariant fashion. However, a serious drawback of this family of p...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1707.06772 شماره

صفحات -

تاریخ انتشار 2017

Improved Bilinear Pooling with CNNs

نویسندگان

چکیده

منابع مشابه

Where to Focus: Deep Attention-based Spatially Recurrent Bilinear Networks for Fine-Grained Visual Recognition

Chimpanzee Faces in the Wild: Log-Euclidean CNNs for Predicting Identities and Attributes of Primates

Learning Convolutional Neural Networks using Hybrid Orthogonal Projection and Estimation

Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks - Counting, Detection, and Tracking

MoNet: Moments Embedding Network

عنوان ژورنال:

اشتراک گذاری